This script presents our group project during the BST209 course. We will show our workflow for training and testing a diabetes mellitus prediction model in ICU patients. An outline for the project workflow is as follows:
Project importance: Why is it important to predict diabetes mellitus in ICU patients?
In the hospital, when patients are admitted to the ICU, providers are often unaware of their past medical history. So, our group’s mission was to build a model that can predict a diagnosis of diabetes mellitus in patients when they are first admitted to the ICU as this would improve clinical decision making and improve care of patients.
First, we loaded the R packages necessary for our project.
Accessed the data from PhysioNet from the WIDS Datathon Project Page. Downloaded ‘training_v2.csv’ from the “Files” section of the project.
Next we loaded in our data and started working with it.
We included eight predictors in our model: age, bmi, gender, ethnicity, glucose apache, H1 glucose max, D1 glucose max, and WBC apache.
# Select subset of variables from original data
reduced_data <- gossis %>% select(bmi,
age,
gender,
glucose_apache,
diabetes_mellitus, h1_glucose_max, d1_glucose_max,ethnicity,wbc_apache)
head(reduced_data) %>% kable()| bmi | age | gender | glucose_apache | diabetes_mellitus | h1_glucose_max | d1_glucose_max | ethnicity | wbc_apache |
|---|---|---|---|---|---|---|---|---|
| 22.73 | 68 | M | 168 | 1 | NA | 168 | Caucasian | 14.1 |
| 27.42 | 77 | F | 145 | 1 | 145 | 145 | Caucasian | 12.7 |
| 31.95 | 25 | F | NA | 0 | NA | NA | Caucasian | NA |
| 22.64 | 81 | F | 185 | 0 | NA | 185 | Caucasian | 8.0 |
| NA | 19 | M | NA | 0 | NA | NA | Caucasian | NA |
| 27.56 | 67 | M | 156 | 1 | NA | 156 | Caucasian | 10.9 |
We also defined our outcome variable and removed the data with unknown outcomes
# Set the outcome variable here
reduced_data$outcome_variable <- as.factor(reduced_data$diabetes_mellitus)
reduced_data <- subset(reduced_data, select = -c(diabetes_mellitus))
# Check number of rows
nrow(reduced_data)## [1] 91713
# For simplicity, we dropped all rows with missing outcomes
reduced_data <- drop_na(reduced_data, any_of("outcome_variable"))
# Check number of rows after removing rows with missing outcomes
nrow(reduced_data)## [1] 90998
Encoding our categorical variables.
Creating the training and test datasets.
# Setting the random number seed for reproducibility
set.seed(1)
# Creating data partition using the outcome variable
train_index <- createDataPartition(reduced_data$outcome_variable,
times = 1, p = 0.8, list = FALSE)
# Splitting data into train and test sets, selected columns that will be used in model
train_set <- reduced_data[train_index, ]
head(train_set)## # A tibble: 6 × 9
## bmi age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 22.7 68 1 168 NA 168 Caucasian
## 2 32.0 25 0 NA NA NA Caucasian
## 3 22.6 81 0 185 NA 185 Caucasian
## 4 57.4 59 0 197 197 197 Caucasian
## 5 NA 70 1 164 NA 129 Caucasian
## 6 NA 45 1 380 365 365 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
## # A tibble: 6 × 9
## bmi age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 27.4 77 0 145 145 145 Caucasian
## 2 NA 19 1 NA NA NA Caucasian
## 3 27.6 67 1 156 NA 156 Caucasian
## 4 25.7 50 1 134 134 134 <NA>
## 5 38.2 81 1 120 121 121 Caucasian
## 6 23.4 79 0 175 NA 175 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
library(tableone)allvars <- c("bmi", "age", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max","ethnicity", "wbc_apache")
catvars <- c("gender")
table_gossis <- CreateTableOne(vars = allvars, data = train_set,
factorVars = catvars,
strata = "outcome_variable")
kableone(table_gossis, caption = "Summary statistics of the training set")| 0 | 1 | p | test | |
|---|---|---|---|---|
| n | 56405 | 16394 | ||
| bmi (mean (SD)) | 28.41 (7.86) | 31.82 (9.00) | <0.001 | |
| age (mean (SD)) | 61.57 (17.41) | 64.62 (14.45) | <0.001 | |
| gender = 1 (%) | 30270 (53.7) | 9044 (55.2) | 0.001 | |
| glucose_apache (mean (SD)) | 141.36 (73.15) | 218.80 (112.45) | <0.001 | |
| h1_glucose_max (mean (SD)) | 147.92 (73.63) | 214.95 (118.80) | <0.001 | |
| d1_glucose_max (mean (SD)) | 154.69 (68.89) | 239.15 (104.81) | <0.001 | |
| ethnicity (%) | <0.001 | |||
| African American | 5620 (10.1) | 1977 (12.2) | ||
| Asian | 647 ( 1.2) | 234 ( 1.4) | ||
| Caucasian | 43997 (79.3) | 12106 (74.6) | ||
| Hispanic | 2279 ( 4.1) | 748 ( 4.6) | ||
| Native American | 419 ( 0.8) | 218 ( 1.3) | ||
| Other/Unknown | 2509 ( 4.5) | 953 ( 5.9) | ||
| wbc_apache (mean (SD)) | 12.11 (6.97) | 12.19 (6.72) | 0.243 |
To deal with the NA values in our data, we decided to replace missing values with the median of the train_set.
Note that it is important that we do this after splitting into training and test sets. If we imputed values using test data, we would risk leaking information about our test set into our training set (“data leakage”).
predictors <- c("age", "bmi", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max", "ethnicity", "wbc_apache")
for (col in predictors) {
test_set[[col]][is.na(test_set[[col]])] <-
median(train_set[[col]], na.rm = TRUE)
train_set[[col]][is.na(train_set[[col]])] <-
median(train_set[[col]], na.rm = TRUE)
}Since our tree model does not require normalization, so we skipped this step.
randomForest library to build a random forest model.for example: method = “svmPoly”, method = “glm”
# Define forest tuning parameters
# 5-fold cross validation
control <- trainControl(method = "repeatedcv", number = 2)
# Number of variables tried at each split
mtry <- sqrt(ncol(train_set))
# Grid search is a linear search through a vector of mtry values
tunegrid <- expand.grid(.mtry = mtry)
# Create classification forest using age, bmi, and gender
forest <- train(outcome_variable ~ age + bmi + gender + glucose_apache + h1_glucose_max + d1_glucose_max + ethnicity + wbc_apache,
data = train_set,
method = "rf",
metric = "Accuracy",
tuneGrid = tunegrid,
trControl = control)caret for variable importance Documentation| Overall | |
|---|---|
| age | 47.0295952 |
| bmi | 67.4601747 |
| gender | 5.8278661 |
| glucose_apache | 80.2409013 |
| h1_glucose_max | 44.6706070 |
| d1_glucose_max | 100.0000000 |
| ethnicityAsian | 0.1917897 |
| ethnicityCaucasian | 4.1155818 |
| ethnicityHispanic | 1.2941783 |
| ethnicityNative American | 0.0000000 |
| ethnicityOther/Unknown | 1.4721757 |
| wbc_apache | 50.3555599 |
We can see that d1_glucose_max and glucose_apache had the most predictive power, followed by bmi.
## # A tibble: 5 × 8
## age bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 77 27.4 0 145 145 145 Caucasian
## 2 19 27.7 1 133 140 150 Caucasian
## 3 67 27.6 1 156 140 156 Caucasian
## 4 50 25.7 1 134 134 134 Caucasian
## 5 81 38.2 1 120 121 121 Caucasian
## # … with 1 more variable: wbc_apache <dbl>
####Original results:
# Make predictions on test set
test_set <- original_test_set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
# Combine unseen patients data with corresponding predictions
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable) %>%
head(30) %>%
kable()| age | bmi | gender | glucose_apache | h1_glucose_max | d1_glucose_max | ethnicity | wbc_apache | prediction | truth_value |
|---|---|---|---|---|---|---|---|---|---|
| 77 | 27.42000 | 0 | 145.0 | 145.000 | 145 | Caucasian | 12.7 | 0 | 1 |
| 19 | 27.65432 | 1 | 133.0 | 140.000 | 150 | Caucasian | 10.4 | 0 | 0 |
| 67 | 27.56000 | 1 | 156.0 | 140.000 | 156 | Caucasian | 10.9 | 0 | 1 |
| 50 | 25.71000 | 1 | 134.0 | 134.000 | 134 | Caucasian | 8.4 | 0 | 0 |
| 81 | 38.18907 | 1 | 120.0 | 121.000 | 121 | Caucasian | 6.9 | 0 | 0 |
| 79 | 23.40898 | 0 | 175.0 | 140.000 | 175 | Caucasian | 13.5 | 0 | 0 |
| 65 | 27.65432 | 0 | 205.0 | 140.000 | 208 | Caucasian | 10.1 | 0 | 1 |
| 60 | 26.48571 | 1 | 149.0 | 140.000 | 149 | African American | 7.5 | 0 | 0 |
| 71 | 34.30569 | 0 | 157.0 | 140.000 | 157 | Caucasian | 5.4 | 0 | 0 |
| 58 | 27.65432 | 0 | 88.0 | 140.000 | 88 | Caucasian | 6.7 | 0 | 0 |
| 85 | 21.88964 | 0 | 81.0 | 140.000 | 81 | Caucasian | 6.0 | 0 | 0 |
| 65 | 26.28642 | 1 | 361.0 | 140.000 | 361 | African American | 10.4 | 1 | 0 |
| 82 | 23.73812 | 1 | 121.0 | 140.000 | 121 | Caucasian | 8.5 | 0 | 0 |
| 64 | 16.98039 | 1 | 96.0 | 140.000 | 113 | Caucasian | 21.8 | 0 | 0 |
| 59 | 14.84493 | 0 | 71.0 | 140.000 | 182 | African American | 1.8 | 0 | 0 |
| 55 | 32.20536 | 1 | 81.0 | 95.000 | 95 | Caucasian | 8.2 | 0 | 0 |
| 62 | 27.26740 | 1 | 72.0 | 87.000 | 87 | Caucasian | 3.5 | 0 | 0 |
| 56 | 34.15440 | 0 | 133.0 | 140.000 | 150 | Caucasian | 10.4 | 0 | 0 |
| 73 | 20.35197 | 0 | 159.0 | 140.000 | 159 | Caucasian | 15.6 | 0 | 0 |
| 48 | 26.51648 | 1 | 96.0 | 140.000 | 96 | Caucasian | 5.6 | 0 | 0 |
| 65 | 33.00135 | 0 | 94.0 | 140.000 | 94 | Caucasian | 10.4 | 0 | 0 |
| 55 | 24.12382 | 1 | 114.0 | 140.000 | 114 | Caucasian | 4.1 | 0 | 0 |
| 71 | 35.07812 | 0 | 181.0 | 140.000 | 181 | Caucasian | 12.1 | 0 | 1 |
| 47 | 28.86719 | 0 | 76.0 | 82.000 | 121 | Asian | 4.7 | 0 | 0 |
| 75 | 27.65432 | 0 | 598.7 | 695.045 | 611 | Caucasian | 11.4 | 1 | 1 |
| 55 | 23.74531 | 0 | 160.0 | 192.000 | 192 | African American | 16.2 | 0 | 0 |
| 74 | 27.65432 | 0 | 324.0 | 140.000 | 324 | Caucasian | 18.0 | 1 | 1 |
| 68 | 18.45329 | 0 | 260.0 | 291.000 | 291 | Caucasian | 27.6 | 0 | 0 |
| 89 | 27.25875 | 1 | 106.0 | 140.000 | 106 | Caucasian | 15.4 | 0 | 0 |
| 69 | 34.75311 | 0 | 133.0 | 140.000 | 145 | Caucasian | 10.4 | 0 | 0 |
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10))
abline(a = 0, b = 1)# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle("Overall prediction")+
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed()accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
print("auc:")## [1] "auc:"
## [1] 0.8130455
## [1] "accuracy:"
## [1] 0.8068026
## [1] "recall:"
## [1] 0.612442
## [1] "specificity:"
## [1] 0.8390238
library(ggplot2)
var_list <- c("gender", "ethnicity")
for (var in var_list)
{
print(var)
list <- unique(original_test_set[[var]])
for (i in list)
{
test_set <- original_test_set[original_test_set[[var]] == i, ]
num = dim(test_set)[1]
# Make predictions on test set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable)
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",i,"(",num,")"))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
print(ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(paste(var,":",i,"(",num,")")) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed())
accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = i))
print(paste(i,"(", num, "):"))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)
}
}## [1] "gender"
## [1] "0 ( 8414 ):"
## [1] "auc:"
## [1] 0.8209621
## [1] "accuracy:"
## [1] 0.8168529
## [1] "recall:"
## [1] 0.6338148
## [1] "specificity:"
## [1] 0.8466215
## [1] "1 ( 9785 ):"
## [1] "auc:"
## [1] 0.8059802
## [1] "accuracy:"
## [1] 0.7985692
## [1] "recall:"
## [1] 0.5961675
## [1] "specificity:"
## [1] 0.832617
## [1] "ethnicity"
## [1] "Caucasian ( 14328 ):"
## [1] "auc:"
## [1] 0.8179748
## [1] "accuracy:"
## [1] 0.8161642
## [1] "recall:"
## [1] 0.6157135
## [1] "specificity:"
## [1] 0.8462712
## [1] "African American ( 1864 ):"
## [1] "auc:"
## [1] 0.777936
## [1] "accuracy:"
## [1] 0.7757511
## [1] "recall:"
## [1] 0.5917722
## [1] "specificity:"
## [1] 0.8133075
## [1] "Asian ( 245 ):"
## [1] "auc:"
## [1] 0.7922566
## [1] "accuracy:"
## [1] 0.7632653
## [1] "recall:"
## [1] 0.5833333
## [1] "specificity:"
## [1] 0.7942584
## [1] "Hispanic ( 735 ):"
## [1] "auc:"
## [1] 0.814394
## [1] "accuracy:"
## [1] 0.7768707
## [1] "recall:"
## [1] 0.5986395
## [1] "specificity:"
## [1] 0.8214286
## [1] "Native American ( 145 ):"
## [1] "auc:"
## [1] 0.8129614
## [1] "accuracy:"
## [1] 0.737931
## [1] "recall:"
## [1] 0.6285714
## [1] "specificity:"
## [1] 0.7727273
## [1] "Other/Unknown ( 882 ):"
## [1] "auc:"
## [1] 0.8015829
## [1] "accuracy:"
## [1] 0.7721088
## [1] "recall:"
## [1] 0.6373626
## [1] "specificity:"
## [1] 0.8071429
df <- data.frame(name = "age", range = c(20, 40, 60, 80))
df <- rbind(df, data.frame(name = "bmi", range = c(20, 30, 40)))
df <- rbind(df, data.frame(name = "glucose_apache", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "h1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "d1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "wbc_apache", range = c(5, 10, 15, 20)))var_list <- c("age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")
for (var in var_list)
{
print(var)
var_range = df[df$name==var,"range"]
list <- unique(original_test_set[[var]])
for (i in 0:length(var_range)+1)
{
if(i==1){
test_set <- original_test_set[original_test_set[[var]] < var_range[i], ]
}
else if(i==length(var_range)+1){
test_set <- original_test_set[original_test_set[[var]] > var_range[i-1], ]
}
else{
test_set <- original_test_set[original_test_set[[var]] > var_range[i-1] & original_test_set[[var]] < var_range[i], ]
}
num = dim(test_set)[1]
if(i==1){
range_text = paste("[", i, "]", "<", var_range[i], "(",num, ")")
}
else if(i==length(var_range)+1){
range_text = paste("[", i, "]", ">", var_range[i-1], "(",num, ")")
}
else{
range_text = paste("[", i, "]", var_range[i-1],"~", var_range[i], "(",num, ")")
}
# Make predictions on test set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable)
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",range_text))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
print(paste(range_text,":"))
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
print(ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(range_text) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed())
accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = range_text))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)
}
}## [1] "age"
## [1] "[ 1 ] < 20 ( 160 ) :"
## [1] "auc:"
## [1] 0.9911937
## [1] "accuracy:"
## [1] 0.9625
## [1] "recall:"
## [1] 0.7
## [1] "specificity:"
## [1] 1
## [1] "[ 2 ] 20 ~ 40 ( 1653 ) :"
## [1] "auc:"
## [1] 0.9303992
## [1] "accuracy:"
## [1] 0.9122807
## [1] "recall:"
## [1] 0.6790123
## [1] "specificity:"
## [1] 0.9376258
## [1] "[ 3 ] 40 ~ 60 ( 4691 ) :"
## [1] "auc:"
## [1] 0.8298128
## [1] "accuracy:"
## [1] 0.8166702
## [1] "recall:"
## [1] 0.5799419
## [1] "specificity:"
## [1] 0.857357
## [1] "[ 4 ] 60 ~ 80 ( 8360 ) :"
## [1] "auc:"
## [1] 0.7820416
## [1] "accuracy:"
## [1] 0.7767943
## [1] "recall:"
## [1] 0.6246334
## [1] "specificity:"
## [1] 0.8064608
## [1] "[ 5 ] > 80 ( 2420 ) :"
## [1] "auc:"
## [1] 0.773025
## [1] "accuracy:"
## [1] 0.8144628
## [1] "recall:"
## [1] 0.6103896
## [1] "specificity:"
## [1] 0.8359982
## [1] "bmi"
## [1] "[ 1 ] < 20 ( 1426 ) :"
## [1] "auc:"
## [1] 0.834415
## [1] "accuracy:"
## [1] 0.8870968
## [1] "recall:"
## [1] 0.5042735
## [1] "specificity:"
## [1] 0.921314
## [1] "[ 2 ] 20 ~ 30 ( 10204 ) :"
## [1] "auc:"
## [1] 0.815502
## [1] "accuracy:"
## [1] 0.8360447
## [1] "recall:"
## [1] 0.6160991
## [1] "specificity:"
## [1] 0.8591229
## [1] "[ 3 ] 30 ~ 40 ( 4930 ) :"
## [1] "auc:"
## [1] 0.7814041
## [1] "accuracy:"
## [1] 0.7561866
## [1] "recall:"
## [1] 0.6107249
## [1] "specificity:"
## [1] 0.7935254
## [1] "[ 4 ] > 40 ( 1636 ) :"
## [1] "auc:"
## [1] 0.7548043
## [1] "accuracy:"
## [1] 0.7114914
## [1] "recall:"
## [1] 0.6424242
## [1] "specificity:"
## [1] 0.7414549
## [1] "glucose_apache"
## [1] "[ 1 ] < 100 ( 4572 ) :"
## [1] "auc:"
## [1] 0.7937578
## [1] "accuracy:"
## [1] 0.861986
## [1] "recall:"
## [1] 0.6477273
## [1] "specificity:"
## [1] 0.8661909
## [1] "[ 2 ] 100 ~ 200 ( 9650 ) :"
## [1] "auc:"
## [1] 0.753888
## [1] "accuracy:"
## [1] 0.8596891
## [1] "recall:"
## [1] 0.5737705
## [1] "specificity:"
## [1] 0.8671061
## [1] "[ 3 ] 200 ~ 300 ( 2503 ) :"
## [1] "auc:"
## [1] 0.6470318
## [1] "accuracy:"
## [1] 0.6048742
## [1] "recall:"
## [1] 0.619211
## [1] "specificity:"
## [1] 0.592371
## [1] "[ 4 ] > 300 ( 1269 ) :"
## [1] "auc:"
## [1] 0.5876709
## [1] "accuracy:"
## [1] 0.6052009
## [1] "recall:"
## [1] 0.6152399
## [1] "specificity:"
## [1] 0.5533981
## [1] "h1_glucose_max"
## [1] "[ 1 ] < 100 ( 1193 ) :"
## [1] "auc:"
## [1] 0.8047401
## [1] "accuracy:"
## [1] 0.8365465
## [1] "recall:"
## [1] 0.6027397
## [1] "specificity:"
## [1] 0.8517857
## [1] "[ 2 ] 100 ~ 200 ( 15255 ) :"
## [1] "auc:"
## [1] 0.7951695
## [1] "accuracy:"
## [1] 0.8254998
## [1] "recall:"
## [1] 0.582517
## [1] "specificity:"
## [1] 0.8486502
## [1] "[ 3 ] 200 ~ 300 ( 1029 ) :"
## [1] "auc:"
## [1] 0.626052
## [1] "accuracy:"
## [1] 0.5986395
## [1] "recall:"
## [1] 0.6352941
## [1] "specificity:"
## [1] 0.5483871
## [1] "[ 4 ] > 300 ( 636 ) :"
## [1] "auc:"
## [1] 0.5722972
## [1] "accuracy:"
## [1] 0.6509434
## [1] "recall:"
## [1] 0.6672504
## [1] "specificity:"
## [1] 0.5076923
## [1] "d1_glucose_max"
## [1] "[ 1 ] < 100 ( 1889 ) :"
## [1] "auc:"
## [1] 0.7760518
## [1] "accuracy:"
## [1] 0.9534145
## [1] "recall:"
## [1] 1
## [1] "specificity:"
## [1] 0.9533898
## [1] "[ 2 ] 100 ~ 200 ( 11849 ) :"
## [1] "auc:"
## [1] 0.7331695
## [1] "accuracy:"
## [1] 0.8575407
## [1] "recall:"
## [1] 0.5461538
## [1] "specificity:"
## [1] 0.860995
## [1] "[ 3 ] 200 ~ 300 ( 2830 ) :"
## [1] "auc:"
## [1] 0.6437554
## [1] "accuracy:"
## [1] 0.5968198
## [1] "recall:"
## [1] 0.609736
## [1] "specificity:"
## [1] 0.5871446
## [1] "[ 4 ] > 300 ( 1456 ) :"
## [1] "auc:"
## [1] 0.5759749
## [1] "accuracy:"
## [1] 0.6112637
## [1] "recall:"
## [1] 0.625817
## [1] "specificity:"
## [1] 0.5344828
## [1] "wbc_apache"
## [1] "[ 1 ] < 5 ( 994 ) :"
## [1] "auc:"
## [1] 0.7947097
## [1] "accuracy:"
## [1] 0.8138833
## [1] "recall:"
## [1] 0.6168224
## [1] "specificity:"
## [1] 0.837655
## [1] "[ 2 ] 5 ~ 10 ( 5506 ) :"
## [1] "auc:"
## [1] 0.8156433
## [1] "accuracy:"
## [1] 0.806393
## [1] "recall:"
## [1] 0.6215236
## [1] "specificity:"
## [1] 0.8390682
## [1] "[ 3 ] 10 ~ 15 ( 7902 ) :"
## [1] "auc:"
## [1] 0.8221807
## [1] "accuracy:"
## [1] 0.8171349
## [1] "recall:"
## [1] 0.6213848
## [1] "specificity:"
## [1] 0.8501701
## [1] "[ 4 ] 15 ~ 20 ( 1960 ) :"
## [1] "auc:"
## [1] 0.792613
## [1] "accuracy:"
## [1] 0.7897959
## [1] "recall:"
## [1] 0.57
## [1] "specificity:"
## [1] 0.8295181
## [1] "[ 5 ] > 20 ( 1602 ) :"
## [1] "auc:"
## [1] 0.794275
## [1] "accuracy:"
## [1] 0.7734082
## [1] "recall:"
## [1] 0.5804598
## [1] "specificity:"
## [1] 0.7969188
var_list <- c("gender","ethnicity","age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, auc)) +
xlab(var) + ylab("AUC") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$auc)-min(tmp$auc))
}## [1] "Gap:"
## [1] 0.01498192
## [1] "Gap:"
## [1] 0.04003881
## [1] "Gap:"
## [1] 0.2181687
## [1] "Gap:"
## [1] 0.07961078
## [1] "Gap:"
## [1] 0.2060869
## [1] "Gap:"
## [1] 0.2324429
## [1] "Gap:"
## [1] 0.2000769
## [1] "Gap:"
## [1] 0.02956778
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, accuracy)) +
xlab(var) + ylab("Accuracy") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$accuracy)-min(tmp$accuracy))
}## [1] "Gap:"
## [1] 0.01828363
## [1] "Gap:"
## [1] 0.07823312
## [1] "Gap:"
## [1] 0.1857057
## [1] "Gap:"
## [1] 0.1756053
## [1] "Gap:"
## [1] 0.2571119
## [1] "Gap:"
## [1] 0.2379071
## [1] "Gap:"
## [1] 0.3565947
## [1] "Gap:"
## [1] 0.04372666
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, recall)) +
xlab(var) + ylab("Recall") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$recall)-min(tmp$recall))
}## [1] "Gap:"
## [1] 0.03764729
## [1] "Gap:"
## [1] 0.0540293
## [1] "Gap:"
## [1] 0.1200581
## [1] "Gap:"
## [1] 0.1381507
## [1] "Gap:"
## [1] 0.07395678
## [1] "Gap:"
## [1] 0.08473348
## [1] "Gap:"
## [1] 0.4538462
## [1] "Gap:"
## [1] 0.05152358
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, specificity)) +
xlab(var) + ylab("Specificity") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$specificity)-min(tmp$specificity))
}## [1] "Gap:"
## [1] 0.01400453
## [1] "Gap:"
## [1] 0.0735439
## [1] "Gap:"
## [1] 0.1935392
## [1] "Gap:"
## [1] 0.1798591
## [1] "Gap:"
## [1] 0.313708
## [1] "Gap:"
## [1] 0.3440934
## [1] "Gap:"
## [1] 0.4189071
## [1] "Gap:"
## [1] 0.05325133
Please see attached separate R Markdown notebooks.